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Abstract 

Unsupervised discovery of latent representa- 
tions, in addition to being useful for den- 
sity modeling, visualisation and exploratory 
data analysis, is also increasingly important 
for learning features relevant to discrimina- 
tive tasks. Autoencoders, in particular, have 
proven to be an effective way to learn la- 
tent codes that reflect meaningful variations 
in data. A continuing challenge, however, is 
guiding an autoencoder toward representa- 
tions that are useful for particular tasks. A 
complementary challenge is to find codes that 
are invariant to irrelevant transformations of 
the data. The most common way of intro- 
ducing such problem-specific guidance in au- 
toencoders has been through the incorpora- 
tion of a parametric component that ties the 
latent representation to the label informa- 
tion. In this work, we argue that a prefer- 
able approach relies instead on a nonpara- 
metric guidance mechanism. Conceptually, 
it ensures that there exists a function that 
can predict the label information, without ex- 
plicitly instantiating that function. The su- 
periority of this guidance mechanism is con- 
firmed on two datasets. In particular, this 
approach is able to incorporate invariance in- 
formation (lighting, elevation, etc.) from the 
small NORB object recognition dataset and 
yields state-of-the-art performance for a sin- 
gle layer, non-convolutional network. 



1 Introduction 

The inference of constrained latent representations 
plays a key role in machine learning and probabilis- 
tic modeling. Broadly, the idea is that discovering a 
compressed representation of the data will correspond 
to determining what is important and unimportant 
about the data. One can also view constrained la- 



tent representations as providing features that can be 
used to solve other machine learning tasks. Of partic- 
ular importance are methods for latent representation 
that can efficiently construct codes for out-of-sample 
data, enabling rapid feature extraction. Neural net- 
works, for example, provide such feed forward feature 
extractors, and autoencoders, specifically, have found 



use in domains such as image classification (Vincent 



et al. 2008), speech recognition (Deng et al. 2010) and 



Bayesian nonparametric models (Adams et al. 2010) 



While the representations learned with autoencoders 
are often useful for discriminative tasks, they require 
that the salient variations in the data distribution be 
relevant for labeling. This is not necessarily always 
the case; as irrelevant factors of variation grow in im- 
portance and increasingly dominate the input distri- 
bution, the representation extracted by autoencoders 



tends to become less useful (Larochelle et al. 20071 



To address this issue, Bengio et al. (20071 introduced 



mild supervised guidance into the autoencoder train- 
ing objective, by adding connections from the hid- 
den layer to output units predicting label information 
(those connections are equivalent to the parameters 
of a logistic regression classifier). The same approach 



was followed by Ranzato and Szummer ( 2008 ), to learn 



compact representations of documents. 

One downside of this approach is that it potentially 
complicates the task of learning the autoencoder rep- 
resentation. Indeed, it now tries to solve two addi- 
tional problems: find a hidden representation from 
which the label information can be predicted and track 
the parametric value of that predictor (i.e. the logis- 
tic regression weights) throughout learning. However, 
we are only interested in the first problem (increased 
predictability of the label). The actual parametric 
value of the label predictor is not important. Once the 
autoencoder is trained, the label predictor can easily 
be found by training a logistic regressor from scratch, 
keeping the hidden layer fixed. We might even want 
to use a classifier that is very different from the logis- 
tic regression classifier for which the hidden layer has 
been trained for. 
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In this work, we investigate this issue and explore a 
different approach to introducing supervised guidance. 
We treat the latent space of the autoencoder as also 
being the latent space for a Gaussian process latent 



of Vincent et al. ( 2008 1 is effective at learning overcom- 



variable model (GPLVM) (Lawrence 2005). The dis- 



criminative labels are then taken to belong to the visi- 
ble space of the GPLVM. The end result is a nonpara- 
metrically guided autoencoder which combines an ef- 
ficient feed-forward parametric encoder/decoder with 
the Bayesian nonparametric inclusion of label informa- 
tion. We discuss how this corresponds to marginalizing 
out the parameters of a mapping from the latent rep- 
resentation to the label and show experimentally how 
this approach is preferable to explicitly instantiating 
such a parametric mapping. Finally, we show how this 
hybrid model also provides a way to guide the autoen- 
coder's representation away from irrelevant features to 
which the encoding should be invariant. 

2 Unsupervised Learning of Latent 
Representations 

We first review the two different latent representation 
learning algorithms on which this work builds. We 
then discuss a relationship between the two that pro- 
vides part of the motivation for the proposed nonpara- 
metrically guided autoencoder. 

2.1 Autoencoder Neural Networks 



Our starting point is the autoencoder (Cottrell et al. 



1987), which is an artificial neural network that is 



trained to reproduce (or reconstruct) the input at its 
output. Its computations are decomposed into two 
parts: the encoder, which computes a latent (often 
lower-dimensional) representation of the input, and 
the decoder, which reconstructs the original input from 
its latent representation. We denote the latent space 
by X and the visible (data) space by y. We assume 
that these spaces are real-valued with dimension J 
and K, respectively, i.e., X =R-^ and y = R^. We de- 
note the encoder, then, as a function g{y ; (j)) : y ^ X 
and the decoder as f{x ; ip) : X y. With a set of N 
examples I' = {y^"''}^=i, y*-"-* G y, we jointly optimize 
the encoder parameters (p and decoder parameters 'ip 
for the least-squares reconstruction cost: 



N K 



r,r = arg min ^ ^ (y^") - /, (5(y(") ; 0) ; V))^ 



71^1 k^l 



(1) 



where //c(') is the fcth output dimension of /(•). Au- 
toencoders have become popular as a module for 



"greedy pre-training" of deep neural networks ( Bengio 



plete latent representations, i.e., codes of higher di- 
mensionality than the input. Overcomplete represen- 
tations are thought to be ideal for discriminative tasks, 
but are difficult to learn due to trivial "identity" so- 
lutions to the autoencoder objective. This problem is 
circumvented in the denoising autoencoder by provid- 
ing as input a corrupted training example, while eval- 
uating reconstruction on the noiseless original. With 
this objective, the autoencoder learns to leverage the 
statistical structure of the inputs to extract a richer 
latent representation. 

2.2 Gaussian Process Latent Variable Models 

One alternative approach to the learning of latent rep- 
resentations is to consider a lower-dimensional mani- 
fold that reflects the statistical structure of the data. 
Such manifolds may be difficult to directly define, how- 
ever, and so many approaches to latent coding frame 
the problem indirectly by specifying distributions on 
functions between the visible and latent spaces. The 
Gaussian process latent variable model (GPLVM) of 



Lawrence (2005) takes a Bayesian probabilistic ap- 



proach to this and constructs a distribution over map- 
ping functions using a Gaussian process (GP) prior. 
The GPLVM results in a powerful nonparametric 
model that analytically marginalizes over the infinite 
number of possible mappings from the latent to the 
visible space. While initially used for visualization of 
high dimensional data, GPLVMs have achieved state- 
of-the-art results for a number of tasks, including mod- 



eling human motion (jWang et al. 2008), classification 
( Urtasun and Darrell 2007 ) and collaborative filtering 
( Lawrence and Urtasun 2009 ) . 



et al. , 20071. In particular, the denoising autoencoder 



As in the autoencoder, the GPLVM assumes that 
the N observed data I' = {y'"''}^=i are the im- 
age of a homologous set {a;^"^}^^]^, arising from 
a vector-valued "decoder" function f{x) : X y. 
Analogously to the squared-loss of the previous 
section, the GPLVM assumes that the observed 
data have been corrupted by zero-mean Gaussian 
noise: =/(a;("))-he with £-[^(0,(7%^). The in- 
novation of the GPLVM is to place a Gaussian process 
prior on the function f{x) and then optimize the la- 
tent representation {a^'"''}^^!, while marginalizing out 
the unknown f{x). 

2.2.1 Gaussian Process Priors 

The Gaussian process provides a flexible distribution 
over random functions, the properties of which can 
be specified via a positive definite covariance function, 
without having to choose a particular finite basis. Typ- 
ically, Gaussian processes are defined in terms of a dis- 
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tribution over scalar functions and in keeping with the 
convention for the GPLVM, we shah assume that K in- 
dependent GPs are used to construct the vector- valued 
function f{x). We denote each of these functions 
as fk{x) : A" — ]R. The GP requires a covariance kernel 
function, which we denote as C{x,x') -.XxX—^-R. 
The defining characteristic of the GP is that for any 
finite set of N data in X there is a corresponding A^- 
dimensional Gaussian distribution over the function 
values, which in the GPLVM we take to be the com- 
ponents of y. The NxN covariance matrix of this 
distribution is the matrix arising from the application 
of the covariance kernel to the points in X. We de- 
note any additional parameters governing the behavior 
of the covariance function by 6. 

Under the component-wise independence assumptions 
of the GPLVM, the Gaussian process prior allows one 
to analytically integrate out the K latent scalar func- 
tions from X to y. Allowing for each of the K Gaus- 
sian processes to have unique hyperparameter 0/;, we 
write the marginal likelihood, i.e., the probability of 
the observed data given the hyperparameters and the 
latent representation, as 



K 

^Uy^iyi^lO.^e.+'J^lN), (2) 



k=l 



where y^'^ refers to the vector . . . , and 

where Se^ is the matrix arising from {a;n}^=i and 9^. 
In the basic GPLVM, the optimal found by 

maximizing this marginal likelihood. 



„(i) 



2.2.2 The Back-Constrained GPLVM 

Although the GPLVM enforces a smooth mapping 
from the latent representation to the observed data, 
the converse is not true: neighbors in observed space 
need not be neighbors in the latent representation. In 
many applications this can be an undesirable prop- 
erty. Furthermore, encoding novel datapoints into the 
latent space is not straightforward in the GPLVM; 
one must optimize the latent representations of out- 
of-sample data using, e.g., conjugate gradient meth- 
ods. With these considerations in mind, [Lawrence and| 
Quinonero-Candela (2006) reformulated the GPLVM 



with the constraint that the hidden representation be 
the result of a smooth map from the observed space. 
Parameterized by (p, this "encoder" function is denoted 
as g{y ; (p) : y ^ X. The marginal likelihood objective 
of this back- constrained GPLVM can now be formu- 



lated as finding the optimal (/) under: 

K 



* = arg mm ^ In | Sg^ +ct^ 



k = l 



+ y]^ {'Eg,,^+aHN)-'yi\ (3) 

where the fcth covariance matrix Sej.,^ now depends 
not only on the kernel hyperparameters 0^, but also 
on the parameters of g{y ; 0), i.e., 

[S«„0]™,„' = C(5(y(");</>),5(y^"'^'^ 



(4) 



Lawrence and Quinonero-Candela (2006) explored 



multilayer perceptrons and radial-basis-function net- 
works as possible smooth maps g{y; (j)). 

2.3 GPLVM as an Infinite Autoencoder 

The relationship between Gaussian processes and arti- 



ficial neural networks is well-established. Neal ( 1996 ) 



showed that the prior over functions implied by many 
parametric neural networks becomes a GP in the limit 



of an infinite number of hidden units, and Williams 



( 1998 ) subsequently derived a covariance function that 



corresponds to such a network under a particular ac- 
tivation function. 

One overlooked consequence of this relationship is that 
it also connects autoencoders and the back-constrained 
Gaussian process latent variable model. By apply- 



ing the covariance function of Williams ( 1998 ) to the 



GPLVM, the resulting model is a density network 



(MacKay, 1994) with an infinite number of hidden 



units in the single hidden layer. Then, using a neu- 
ral network for the GPLVM backconstraints trans- 
forms the density network into a semiparametric au- 
toencoder, where the encoder is a parametric neural 
network and the decoder is a Gaussian process. 

Alternatively, one can start from the autoencoder and 
notice that, for a linear decoder with a least-squares 
reconstruction cost and zero-mean Gaussian prior over 
its weights, it is possible to integrate out the de- 
coder. Learning then corresponds to the minimization 
of Eqn. ([3| with a linear kernel for Eqn. ([4|. Any 
non-degenerate positive definite kernel corresponds to 
a decoder of infinite size, and also recovers the general 
back-constrained GPLVM algorithm. 

Such an infinite autoencoder exhibits some desirable 
properties. The infinite decoder network obviates the 
need to explicitly specify and learn a parametric form 
for the generally superfluous decoder network and 
rather marginalises over all possible decoders. This 
comes at the cost of having to invert as many matrices 
(the GP covariances) as there are input dimensions. 
Hence, for large input dimensionality, one could argue 
that the fully parametric autoencoder is preferable. 
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3 Supervised Guiding of Latent 
Representations 

As discussed earlier, when the sahent variations in the 
input are only weakly informative about a particular 
discriminative task, it can be useful to incorporate la- 
bel information into unsupervised learning. |Bengio| 



et al. (20071 showed, for example, that while a purely 



supervised signal can lead to overfitting, mild super- 
vised guidance can be beneficial when initializing a 
discriminative deep neural network. For that reason. 



Bengio et al. (2007) proposed that latent representa- 



tions also be trained to predict the label information, 
by adding a parametric mapping c{x ] A) : X ^ Z 
from the latent representation's space X to the label 
space Z and backpropagating error gradients from the 
output to the representation. Bengio et al. (2007) in- 



vestigated the use of a linear logistic regression clas- 
sifier for the parametric mapping. Such "partial su- 
pervision" would encourage discovery of a latent rep- 
resentation that is useful to a specific (but learned) 
parametrization of such a linear classifier. A similar 



the parameters A of a label map c{x ; A) under a dis- 
tribution that permits a wide family of functions. We 
have seen previously that this can be done for recon- 
structions of the input space with a decoder f{x; 
We follow the same reasoning and do this instead 
for c{x ; A) . Integrating out the parameters of the 
label map yields a back-constrained GPLVM acting 
on the label space Z, where the back constraints are 
determined by the input space 3^. The positive defi- 
nite kernel specifying the Gaussian process then deter- 
mines the properties of the distribution over mappings 
from the latent representation to the labels. The result 
is a hybrid of the autoencoder and back-constrained 
GPLVM, where the encoder is shared across models. 
For notation, we will refer to this approach to guided 
latent representation as a nonparametrically guided au- 
toencoder, or NPGA. 

Let the label space Z be an M-dimensional real spac^ 
i.e., Z = 'R^^ , and the nth training example has a label 
vector e Z. The covariance function that relates 
label vectors in the NPGA is 



approach was used by iRanzato and Szummerl (l2M8|) Pe„.0,r]«,n' - C{T ■ ^(y^"); 0), T ■ g{y^'' ^-cj))- e.m), 



to learn compact representations of documents. 

There are two disadvantages to this strategy. First, 
the assumption of a specific parametric form for the 
mapping c{x ; A) restricts the guidance to classifiers 
within that family of mappings. The second is that 
the learned representation is committed to one partic- 
ular setting of the parameters A. Consider the learn- 
ing dynamics of gradient descent optimization for this 
strategy. At every iteration t of descent (with current 
state (fit^'^l^t, At), the gradient from supervised guid- 
ance encourages the latent representation (currently 
parametrized by (l)t,ipt) to become more predictive of 
the labels under the current label map c{x ; At). Such 
behavior discourages moves in 0, ip space that make 
the latent representation more predictive under some 
other label map c{x ; A*) where A* is potentially dis- 
tant from Aj. Hence, while the problem would seem to 
be alleviated by the fact that A is learned jointly, this 
constant pressure towards representations that are im- 
mediately useful should increase the difiiculty of rep- 
resentation learning. 

3.1 Nonparametrically Guided Autoencoder 

Rather than directly specifying a particular discrimi- 
native regressor for guiding the latent representation, 
it seems more desirable to simply ensure that such a 
function exists. That is, we would prefer not to have to 
choose a latent representation that is tied to a specific 
map to labels, but instead find representations that are 
consistent with many such maps. One way to arrive 
at such a guidance mechanism is to marginalize out 



where F e M'^^'^ is an i7-dimensional linear projection 
of the encoder output. For _ff <C J, this projection 
improves efficiency and reduces overfitting. Learning 
in the NPGA is then formulated as finding the opti- 
mal 0,-0, r under the combined objective: 

(/)*,i/'*,r* = arg min(l-a)Lauto(0» +aiGp(0,r) 
4>,ip.v 

where a € [0, 1] linearly blends the two objectives 

N K 



n=l k=l 
M 



m— 1 



We use a linear decoder for f{x; and the encoder 
g{y; (j)) is a linear transformation followed by a fixed 
element-wise nonlinearity. 

As is common for autoencoders and to reduce the 
number of free parameters in the model, the encoder 
and decoder weights are tied. For the larger NORB 
dataset, we divide the training data into mini-batches 
of 350 training cases and perform three iterations of 
conjugate gradient descent per mini-batch. Finally, as 
proposed in the denoising autoencoder variant of |Vin-| 
cent et al. (2008), we always add noise to the encoder 
inputs in cost Lauto(0, V'); keeping the noise fixed dur- 
ing each iteration. 

'^For discrete labels, we use a "one-hot" encoding. 
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3.2 Related Models 

The combination of parametric unsupervised learning 
and nonparametric supervised learning has been ex- 
amined previously. Salakhutdinov and Hinton (2007) 



mation and additional signals. The Discriminative 



proposed merging autoencoder training with nonlin- 
ear neighborhood component analysis, which encour- 
ages the encoder to have similar outputs for similar 
inputs belonging to the same class. Note that the 
backconstrained-GPLVM performs a similar role. Ex- 
amining Equation [3] one can see that the first term, 
the log determinant of the kernel, regularizes the la- 
tent space. It pulls all examples together as the de- 
terminant is minimized when the covariance between 
all pairs is maximized. The second term is a data fit 
term, pushing examples that are distant in label space 
apart in the latent space. In the case of a one-hot 
coding, the labels act as indicator variables including 
only indices of the concentration matrix that reflect 
inter-class pairs in the loss. Thus the GPLVM enforces 
that examples close in the label space will be closer 
in the latent space than examples that are distant 
in label space. There are several notable differences, 
however, between this work and the NPGA. First, as 
the NPGA is a natural generalization of the back- 
constrained GPLVM, it can be intuitively interpreted 
as a marginalization of label maps, as discussed in the 
previous section. Second, the NPGA enables the wide 
library of covariance functions from the Gaussian pro- 
cess literature to be incorporated into the framework of 
learning guided representation and naturally accomo- 
dates continuous labels. Finally, as will be discussed 



in Section 4.2 the NPGA not only enables learning of 
unsupervised features that capture discriminatively- 
relevant information, but also allows representations 
that can ignore irrelevant information. 

Previous work has also hybridized Gaussian processes 



and unsupervised connectionist learning. In Salakhut 



dinov and Hinton (2008), restricted Boltzmann ma- 



chines were used to initialize a neural network that 
would provide features to a Gaussian process regressor 
or classifier. Unlike the NPGA, however, this approach 
does not address the issue of guided unsupervised rep- 
resentation. Indeed, in NPGA, Gaussian processes are 
used only for representation learning, are applied only 
on small mini-batches and are not required at test 
time. This is important, since deploying a Gaussian 
process on large datasets such as the NORB data poses 
significant practical problems. Because their method 
relies on a Gaussian process at test time, a direct ap- 
plication of the approach proposed by |Salakhutdinov| 
and Hinton (2008) would be prohibitively slow. 



Although the GPLVM was originally proposed as a 
latent variable model conditioned on the data, there 
has been work on adding discriminative label infor- 



GPLVM (DGPLVM) ( [Urtasun and DarrelH |2007[ ) in- 
corporates discriminative class labels through a prior 
based on discriminant analysis that enforces separa- 
bility between classes in the latent space. The DG- 
PLVM is, however, restricted to discrete labels and 
requires a GP mapping to the data, which is compu- 
tationally prohibitive for high dimensional data. |Shon| 
etaT] ( [2055| introduced a Shared-GPLVM (SGPLVM) 
that used multiple GPs to map from a single shared 
latent space to various related signals. |Wang et al.| 



( 2007 ) demonstrate that a generalisation of multilinear 



models arises as a GPLVM with product kernels, each 
mapping to different signals. This allows one to sepa- 
rate various signals in the data within the context of 
the GPLVM. Again, due to the Gaussian process map- 
ping to the data, the shared and multifactor GPLVM 
are not feasible on high dimensional data. Our model 
overcomes the limitations of these through using a nat- 
ural parametric form of the GPLVM, the autoencoder, 
to map to the data. 

4 Empirical Analyses 

We now present experiments with NPGA on two dif- 
ferent classification datasets. Our implementation of 
NPGA is available for download at http : / /removed . ' 
for.caionymity.org. In all experiments, the discrim- 
inative value of the learned representation is evalu- 
ated by training a linear (logistic) classifier, a standard 
practice for evaluating latent representations. 

4.1 Oil Flovir Data 

As an initial empirical analysis we consider a multi- 
phase oil flow classification problem (Bishop and 



James 1993). The data are twelve-dimensional, real- 



valued measurements of gamma densitometry mea- 
surements from a simulation of multi-phase oil fiow. 
The classification task is to determine from which of 
three phase configurations each example originates. 
There are 1,000 training and 1,000 test examples. The 
relatively small size of these training data make them 
useful for empirical evaluation of different models and 
training procedures. We use these data primarily to 
address two concerns: 

• To what extent does the nonparametric guidance 
of an unsupervised parametric autoencoder im- 
prove the learned feature representation with re- 
spect to the classification objective? 

• What additional benefit is gained through using 
nonparametric guidance over simply incorporat- 
ing a parametric mapping to the labels? 
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Figure 1: The effect of scaling the relative contributions of the autoencoder, logistic regressor and GP costs in the 
hybrid objective by modifying a and /3. (a) Classification error on the test set on a linear scale from 6% (dark 
red) to 1% (dark blue) (b) Cross-sections of (a) at /3 = (a fully parametric model) and /3 = 1 (NPGA). {a — 0.5) 
(c & d) Latent projections of the 1000 test cases within the latent space of the GP for a NPGA {a = 0.5) and a 
back-constrained GPLVM. 



To address these questions, we construct a new objec- 
tive that linearly blends our proposed supervised guid- 
ance cost Lqp{(P, r) with the one proposed by Bengio 
et al.|(|2007|), referred to as Llr(0, T): 



L((/),V',A,r; a,/3) = (l-a)Lauto(0, "0) 
+ a((l-/3)LLR(0,A) 
+ /3LGp(0,r)), 

where /3 £ [0, 1]. A are the parameters of a multi-class 
logistic regressor that maps to the labels. Thus, a 
controls the relative importance of supervised guid- 
ance, while /3 controls the relative importance of the 
parametric and nonparametric supervised guidance. 

A grid search over a and /3 was performed at intervals 
of 0.1 to assess the benefit of the nonparametric guid- 
ance. At each interval a model was trained for 100 iter- 
ations and classification performance was assessed via 
logistic regression on the hidden units of the encoder. 
Notice how the cost iLR(0, A) is specifically tailored 
to this situation. The encoder used 250 noisy rectified 
linear (NRenLU ( |Nair and Hinton 2010)) units, and 
zero-mean Gaussian noise with a standard deviation of 
0.05 was added to the inputs of the autoencoder cost. 
A subset of 100 training samples was used to make 
the problem more challenging. Each experiment was 
repeated 20 times with random initializations. The 
GP label mapping used an RBF kernel and worked on 
a projected space of dimension H = 2. 



Results are presented in Fig. [T] Fig. lb demonstrates 
that performance improves by integrating out the la- 
bel map, even when compared with direct optimiza- 
tion under the discriminative family that will be used 
at test time. Figs. [Tc] and |ld| provide a visualisation of 
the latent representation learned by NPGA and a stan- 
dard back-constrained GPLVM. We see that the for- 
mer embeds much more class-relevant structure than 
the latter. 



An interesting observation is that a simple linear ker- 
nel also tends to outperform parametric guidance (see 



Fig. lb). This doesn't mean that any kernel will work 



for any problem. However, this confirms that the ben- 
efit of our approach is achieved mainly through inte- 
grating out the label mapping, rather than having a 
more powerful nonlinear mapping to the label. 

4.2 Small NORB Image Data 

As a second empirical analysis, the NPGA is evalu- 
ated on a challenging dataset with multiple discrete 



and real- valued labels. The small NORB data ( LeCun 



et al. 2004 1 are stereo image pairs of fifty toys belong- 



ing to five generic categories. Each object was imaged 
under six lighting conditions, nine elevations and eigh- 
teen azimuths. The objects were divided evenly into 
test and training sets yielding 24,300 examples each. 

The variations in the data resulting from the different 
imaging conditions impose significant nuisance struc- 
ture that will invariably be learned by a standard au- 
toencoder. Fortunately, these variations are known a 
priori. In addition to the class labels, there are two 
real-valued vectors (elevation and azimuth) and one 
discrete vector (lighting type) associated with each im- 
age. In our empirical analysis we examine two ques- 
tions: 

• As the autoencoder attempts to coalesce the vari- 
ous sources of structure into its hidden layer, can 
the NPGA guide the learning in such a way as 
to separate the class-invariant transformations of 
the data from the class-relevant information? 

• Are the benefits of nonparametric guidance still 
observed in a larger scale classification problem, 
when mini-batch training is used? 
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Figure 2: Visualisations of the NORB training (left) 
and test (right) data latent space representations in 
the NPGA, corresponding to class (first row), eleva- 
tion (second row), and lighting (third row). Colors 
correspond to class labels. 

To address this question, an NPGA was employed with 
GPs mapping to each of the four labels. Each GP was 
applied to a unique partition of the hidden units of 
an autoencoder with 2400 NReLU units. A GP map- 
ping to the class labels was applied to half of the hid- 
den units and operated on a H = A dimensional latent 
space. The remaining 1200 units were divided evenly 
among GPs mapping to the three auxiliary labels. As 
the lighting labels are discrete, they were treated sim- 
ilarly to the class labels, with H = 2. The elevation 
labels are continuous, so the GP was mapped directly 
to the labels, with H — 2. Finally, as the azimuth is 
a periodic signal, a periodic kernel was used for the 
azimuth GP, with H = l. This elucidates a major ad- 
vantage of our approach, as the GP provides flexibility 
that would be challenging with a parametric mapping. 

This configuration was compared to an autoencoder 
(a — O), an autoencoder with parametric logistic re- 
gression guidance and a similar NPGA where only a 
GP to classes was applied to all the hidden units. A 
back-constrained GPLVM and SGPLVM were also ap- 
plied to these data for comparisorj^ The results]^ are 

^The GPLVM and SGPLVM were applied to a 96 di- 
mensional PCA of the data for computional tractability, 
used a neural net covariance mapping to the data, and 
otherwise used the same back-constraints, kernel configu- 
ration, and minibatch training as the NPGA. 

^A validation set of 4300 training cases was withheld 



Model Accuracy 

Autoencoder + 4(Log)reg {a = 0.5) 85.97% 

GPLVM 88.44% 

SGPLVM (4 GPs) 89.02% 

NPGA (4 GPs Lin - a = 0.5) 92.09% 

Autoencoder 92.75% 

Autoencoder -I- Logreg {a = 0.5) 92.91% 

NPGA (1 GP NN - a = 0.5) 93.03% 

NPGA (1 GP Lin ~ a = 0.5) 93.12% 

NPGA (4 GPs Mix - a = 0.5) 94.28% 



K-Nearest Neighbors 



( LeCun et al 



Gaussian SV 



I! 



20041 



( Salakhutdinov and Larochelle 20101 
3 Layer DBN 



(Salakhutdinov and Larochelle 20101 
DBMTMF-FULE 



(Salakhutdinov and Larochelle 20101 
Third Order REM 



83.4% 
88.4% 
91.69% 
92.77% 
93.5% 



(Nair and Hinton 



2009) 



Table 1: Experimental results on the small NORB 
data test set. Relevant published results are shown 
for comparison. NN, Lin and Mix indicate neural net- 
work, linear and a combination of neural network and 
periodic covariances respectively. 

reported in Table [T] A visualisation of the structure 
learned by the GPs is shown in Figure [2] 

The model with 4 GPs with nonlinear kernels obtains 
an accuracy of 94.28% and significantly outperforms 
all other models, achieving to our knowledge the best 
(non-convolutional) results for a shallow model on this 
dataset. Applying nonparametric guidance to all four 
of the signals appears to separate the class relevant 
information from the irrelevant transformations in the 
data. Indeed, a logistic regression classifier trained 
only on the 1200 hidden units on which the class GP 
was applied achieves a test error of 94.02%, implying 
that half of the latent representation can be discarded 
with virtually no discriminative penalty. 

One interesting observation is that, for linear kernels, 
guidance with respect to all labels decreases the per- 
formance compared to using guidance only from the 
class label (from 93.03% down to 92.09%). An au- 
toencoder with parametric guidance to all four labels 



for parameter selection and early stopping. Neural net co- 
variances with fixed hyperparameters were used for each 
GP, except for the GP on the rotation label, which used a 
periodic kernel. The raw pixels were corrupted by setting 
the value of 20% of the pixels to zero for denoising au- 
toencoder training. Each image was lighting and contrast 
normalized. The error on the test set was evaluated using 
logistic regression on the hidden units of each model. 
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was tested as well, mimicking the configuration of the 
NPGA, with two logistic and two gaussian outputs op- 
erating on separate partitions of the hidden units. This 
model achieved only 86% accuracy. These observations 
highlight the advantage of the GP formulation for su- 
pervised guidance, which gives the flexibility of choos- 
ing an appropriate kernel for different label mappings 
(e.g. a periodic kernel for the rotation label). 

5 Conclusion 

In this paper we observe that the back-constrained 
GPLVM can be interpreted as the infinite limit of 
a particular kind of autoencoder. This relationship 
enables one to learn the encoder half of an autoen- 
coder while marginalizing over decoders. We use this 
theoretical connection to marginalize over functional 
mappings from the latent space of the autoencoder to 
any auxiliary label information. The resulting non- 
parametric guidance encourages the autoencoder to 
encode a latent representation that captures salient 
structure within the input data that is harmonious 
with the labels. Specifically, it enforces the require- 
ment that a smooth mapping exists from the hidden 
units to the auxiliary labels, without choosing a par- 
ticular parameterization. By applying the approach 
to two data sets, we show that the resulting non- 
parametrically guided autoencoder improves the latent 
representation of an autoencoder with respect to the 
discriminative task. Finally, we demonstrate on the 
NORB data that this model can also be used to dis- 
courage latent representations that capture statistical 
structure that is known to be irrelevant through guid- 
ing the autoencoder to separate the various sources of 
variation. This achieves state-of-the-art performance 
for a shallow non-convolutional model on NORB. 

References 

R. P. Adams, Z. Ghahramani, and M. I. Jordan. Tree- 
structured stick breaking for hierarchical data. In Neural 
Information Processing Systems, 2010. 

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. 
Greedy layer- wise training of deep networks. In Neural 
Information Processing Systems, pages 153-160, 2007. 

C. M. Bishop and G. D. James. Analysis of multiphase 
iiows using dual-energy gamma densitometry and neural 
networks. Nuclear Instruments and Methods in Physics 
Research, pages 580-593, 1993. 

G. W. Cottrell, P. Munro, and D. Zipser. Learning internal 
representations from gray-scale images: An example of 
extensional programming. In Conference of the Cogni- 
tive Science Society, pages 462-473, 1987. 

L. Deng, M. Seltzer, D. Yu, A. Acero, A.-R. Mohamed, 
and G. E. Hinton. Binary coding of speech spectrograms 
using a deep autoencoder. In Interspeech, 2010. 



H. Larochelle, D. Erhan, A. Courvillc, J. Bcrgstra, and 
Y. Bengio. An empirical evaluation of deep architec- 
tures on problems with many factors of variation. In 
International Conference on Machine Learning, 2007. 

N. D. Lawrence. Probabilistic non-linear principal com- 
ponent analysis with Gaussian process latent variable 
models. Journal of Machine Learning Research, 6:1783- 
1816, 2005. 

N. D. Lawrence and J. Quifionero-Candela. Local distance 
preservation in the GP-LVM through back constraints. 

In International Conference on Machine Learning, 2006. 

N. D. Lawrence and R. Urtasun. Non-linear matrix factor- 
ization with Gaussian processes. In International Con- 
ference on Machine Learning, 2009. 

Y. LeCun, F. J. Huang, and L. Bottou. Learning methods 
for generic object recognition with invarianco to pose 
and lighting. Computer Vision and Pattern Recognition, 
2004. 

D. J. MacKay. Bayesian neural networks and density net- 
works. In Nuclear Instruments and Methods in Physics 
Research, A, pages 73-80, 1994. 

V. Nair and G. E. Hinton. 3d object recognition with deep 
belief nets. In Neural Information Processing Systems, 
2009. 

V. Nair and G. E. Hinton. Rectified linear units improve 
restricted Boltzmann machines. In International Con- 
ference on Machine Learning, 2010. 

R. Neal. Bayesian learning for neural networks. Lecture 

Notes in Statistics, 118, 1996. 

M. Ranzato and M. Szummer. Semi-suporvisod learning of 
compact document representations with deep networks. 
In International Conference on Machine Learning, 2008. 

R. Salakhutdinov and G. Hinton. Learning a nonlinear 
embedding by preserving class neighbourhood structure. 
In Artificial Intelligence and Statistics, 2007. 

R. Salakhutdinov and G. Hinton. Using deep belief nets 
to leaxn covariance kernels for Gaussian processes. In 
Neural Information Processing Systems, 2008. 

R. Salakhutdinov and H. Larochelle. EflScient learning of 
deep Boltzmann machines. In Artificial Intelligence and 
Statistics, 2010. 

A. P. Shon, K. Grochow, A. Hcrtzmann, and R. P. N. Rao. 
Learning shared latent structure for image synthesis and 
robotic imitation. In Neural Information Processing Sys- 
tems, 2005. 

R. Urtasun and T. Darrell. Discriminative Gaussian pro- 
cess latent variable model for classification. In Interna- 
tional Conference on Machine Learning, 2007. 

P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. 
Extracting and composing robust features with denois- 
ing autoencoders. In International Conference on Ma- 
chine Learning, 2008. 

J. M. Wang, D. J. Fleet, and A. Hertzmann. Multifactor 
Gaussian process models for style-content separation. In 
International Conference on Machine Learning, volume 
227, 2007. 

J. M. Wang, D. J. Fleet, and A. Hcrtzmann. Gaussian pro- 
cess dynamical models for human motion. IEEE PAMI, 
30(2):283-298, 2008. 

C. K. I. Williams. Computation with infinite neural net- 
works. Neural Computation, 10(5):1203-1216, 1998. 



